An Exponential Tail Bound for Lq Stable Learning Rules. Application to k-Folds Cross-Validation

نویسندگان

  • Karim T. Abou-Moustafa
  • Csaba Szepesvári
چکیده

We consider a priori generalization bounds developed in terms of cross-validation estimates and the stability of learners. In particular, we first derive an exponential Efron-Stein type tail inequality for the concentration of a general function of n independent random variables. Next, under some reasonable notion of stability, we use this exponential tail bound to analyze the concentration of the k-fold cross-validation (KFCV) estimate around the true risk of a hypothesis generated by a general learning rule. While the accumulated literature has often attributed this concentration to the bias and variance of the estimator, our bound attributes this concentration to the stability of the learning rule and the number of folds k. This insight raises valid concerns related to the practical use of KFCV, and suggests research directions to obtain reliable empirical estimates of the actual risk. k-Folds cross-validation (KFCV) is a widely used procedure to estimate the empirical risk of a hypothesis obtained from a certain learning rule (Stone 1974; Geisser 1975). It is used in practice with the promise of being more accurate than the training error, while not being overly computationally expensive as the deleted (or the leave-one-out) estimate which is considered an unbiased estimate of the actual risk (under some notion of stability of the learning rules) (Devroye, Györfi, and Lugosi 1996; Blum, Kalai, and Langford 1999). As such, it is natural to ask how well does the KFCV estimate concentrate around the risk of the hypothesis returned by the sought learning rule. Various works have considered different aspects of this question. Blum, Kalai, and Langford (1999) show that the KFCV estimate is more accurate than the training error based on its variance and higher order moments. Kale, Kumar, and Vassilvitskii (2011), under some notion of stability, show that the averaging taking place in the KFCV estimate leads to a tighter concentration of the estimated risk around its expectation. Note that this is different from considering the concentration of the estimated risk around the actual risk of the hypothesis. Cornec (2017), in the spirit of sanity-check bounds (Kearns and Ron 1999), shows that for empirical risk minimizers over VC–classes, the worst case error for the KFCV estimate is not much worse than that of the training error. Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. In this work we consider the exponential concentration of the KFCV estimate around the actual risk of a hypothesis returned by a stable learning rule under distribution-dependent notions of stability. Our hope is to obtain a high probability generalization bound for the KFCV estimate without being dependent on overly restrictive notions of stability such as uniform stability (explained below) (Kutin and Niyogi 2002). Earlier works have derived such concentration results for the deleted estimate and learning rules that are uniformly stable in the sense that no matter how the input to the learning rule is selected, and no matter what value is used as a test example, replacing (or removing) one example in the input, the prediction loss will change only in a limited fashion (Bousquet and Elisseeff 2002). The stability coefficient of a learning rule is the amount of this change. Bousquet and Elisseeff (2002) considered the concentration of the deleted estimate and resubstitution estimate around the (random) risk of a hypothesis returned by a uniformly stable learning rule. The main observation of Bousquet and Elisseeff (2002) is that uniform stability (a worst-case notion over all training and test examples) allows an elegant use of McDiarmid’s inequality, which leads to exponential tail bounds. Kutin and Niyogi (2002) and Rakhlin, Mukherjee, and Poggio (2005) consider a softening of the stringent requirement underlying uniform stability to “almost everywhere” stability. While Kutin and Niyogi (2002) prove their result by extending McDiarmid’s inequality, Rakhlin, Mukherjee, and Poggio (2005) used the higher-moment version of the Efron-Stein inequality due to Boucheron, Lugosi, and Massart (2003). Uniform stability is unpleasantly restrictive: Unlike other notions of stability (e.g., L, or L stability), it is insensitive to the data-generating distribution. This is problematic as it removes the possibility of studying large classes of learning rules, or even classes of problems. One particularly striking example is binary classification with the zero-one loss (Kutin and Niyogi 2002). Another example when uniform stability fails is regression with unbounded response variables and losses. In addition, as noted earlier, uniform stability is distribution-free and is thus unsuitable to studying finer details of learning. Since we are interested in the tail properties of KFCV and higher moments are sufficient and necessary to characterize the tails of random variables, it is natural to expect that the whole family of L-stability coefficients with q ≥ 1 would play a role in determining the tail behavior of KFCV. The advantage of using L stability coefficients to uniform (which in a way are close to L∞ coefficients) is that they are distribution dependent and are nontrivial even when the uniform stability coefficient is uncontrolled. Recent, yet unpublished work by Celisse and Guedj (2016) indeed demonstrated that the family of L stability coefficients can be successfully used to study the deviation of deleted estimates. While we also use the same family of stability coefficients, our work goes beyond the work of Celisse and Guedj (2016) in that we consider distribution dependent concentration bounds for the KFCV estimate. While our techniques resemble those of Celisse and Guedj (2016), we streamline several steps of their proofs. One difference is that we build directly on the elegant Efron-Stein style exponential inequality of Boucheron, Lugosi, and Massart (2003), while Celisse and Guedj (2016) chose a different route. 1 Setup and Notations We consider learning in Vapnik’s framework for risk minimization with bounded losses (Vapnik 1995): A learning problem is specified by the triplet (H,X , `), where H,X are sets and ` : H × X → [0, 1]. The set H is called the hypothesis space, X is called the instance space, and ` is called the loss function. The loss `(h, x) indicates how well a hypothesis h ∈ H explains (or fits) an instance x ∈ X . The learning problem is defined as follows: A learner A sees a sample in the form of a sequence Sn = (X1, . . . , Xn) ∈ X where (Xi)i is sampled in an independent and identically distributed (i.i.d) fashion from some unknown distribution P and returns a hypothesis ĥn = A(Sn) ∈ H based solely on X1, . . . , Xn. The goal of the learner is to pick hypotheses with a small risk (defined shortly). For readers familiar with learning theory we remark that as opposed to most of statistical learning theory, the only roleH plays is to collect the universe of all choices available to learning rules. In particular, unlike in most of the literature on statistical learning theory, it will not be used to “control the bias of learners”. We assume that a learner is able to process samples of different cardinality. Hence, a learner will be identified with a map A : ∪nX → H. Here, we only consider deterministic learning rules; extension to randomizing learning rules is left for future work. Given a distribution P on X , and X ∼ P , the risk of a fixed hypothesis h ∈ H is given by R(h,P) = E [` (h,X)]. Since Sn is random, so is A(Sn). Therefore, we define the risk of the hypothesis that A(Sn) returns by: R(A(Sn),P) = E[` (A(Sn), X) |Sn]. Note that R(A(Sn),P) is also a random quantity. Ideal learners keep the risk R(A(Sn),P) of the hypothesis returned by A small for a wide range of distributions P . q-Norm of RVs In the sequel, we will heavily rely on the q-norm for a random variable (RV). For a real RV X , and for 1 ≤ q ≤ +∞, the q-norm of X is defined as: ‖X‖q . = The set X is thus measurable. In general, for the sake of minimizing clutter, we will skip mentioning measurability issues; in particular, all functions are assumed to be measurable as needed. (E [|X|]), and ‖X‖∞ is the essential supremum of |X|. Note that for 1 ≤ q ≤ p ≤ +∞, the following property holds for the q-norm: ‖·‖q ≤ ‖·‖p. 1.1 Quality Assessment of Learners Most of statistical learning theory is devoted to answering the following two questions: (i) A posteriori performance assessment: How well did A work on some data Sn drawn from some distribution P? (ii) A priori performance prediction: How well will A perform on data Sn that will be drawn from some distribution P? For both questions, the answer should be given in terms of the risk R(A(Sn),P) of the hypothesis A(Sn). Since Sn and A(Sn) are random quantities, in general, the answers to the above questions will be upper bounds, the so-called generalization bounds, on the random risk R(A(Sn),P) that have a probabilistic nature; i.e. the bounds hold with high probability, or hold for the expected risk Rn(A,P) = E[` (A(Sn), X)], or the higher moments of the risk. The two questions are similar in that both of them concern performance on unseen data (since the definition of the risk involves future unseen data). As a result, often the questions are answered using similar tools. The two questions are also fundamentally different: in the case of the first question the data Sn that produces the hypothesis A(Sn) is already given, while in the second case the data is yet unknown at the time when the question is asked. Correspondingly, we call bounds answering the first question a posteriori (“after the fact”) bounds, while we call bounds answering the second question a priori bounds. Ideal a posteriori bounds depend on both A and Sn (i.e., these bounds should be learnerand datadependent), while in the case of a priori bounds, the bound can at best depend on A and P (i.e., they can be learnerand distribution-dependent). In this paper we consider the second question, i.e., a priori generalization bounds. In particular, we consider a priori generalization bounds developed in terms of cross-validation estimates and the stability of learners. 2 Efron-Stein Concentration Inequalities The main tool for our work is an extension of the EfronStein inequality (Efron and Stein 1981; Steele 1986), to a stronger version known as the exponential Efron-Stein inequality (Boucheron, Lugosi, and Massart 2003). The EfronStein inequality is a strong tool itself to bound the variance V[Z] . = E[(Z − EZ)] of a random variable Z which is a function (call this f ) of a number of independent RVs. The idea of the Efron-Stein inequality is to “decompose” the variance into the sum V of variance-like terms that express the sensitivity of the function f to its individual variables in an appropriate manner. Oftentimes, these individual sensitivities are easier to control than the variance directly. The crucial feature of the inequality is that it avoids pessimistic worstcase bounds like those that underly McDiarmid’s inequality (McDiarmid 1989). While bounding the variance itself is crucial, we will need exponential concentration bounds on the tails of Z. Such bounds were derived in the work of Boucheron, Lugosi, and Massart (2003), and Boucheron, Lugosi, and Massart (2013). Here, based on the techniques developed in this groundbreaking work, we derive a new tail inequality which will better suit our purposes. We start by introducing the Efron-Stein inequality and some variations. The inequalities shown here will be useful in our derivations on their own. Let f : X 7−→ R be a real-valued function of n variables, where X is a measurable space (not necessarily the same as in the previous section). If X1, . . . , Xn are independent (not necessarily identically distributed) RVs taking values in X , define the RV Z = f(X1, . . . , Xn) ≡ f(Sn). Define the shorthand for the conditional expectation E−iZ . = E [ Z|S−i n ] , where S−i n = (X1, . . . , Xi−1, Xi+1, . . . , Xn); i.e., it is the sequence Sn with example Xi removed. Informally, E−iZ “integrates” Z over Xi and also over any other source of randomness in Z except S−i n . The celebrated Efron-Stein inequality bounds the variance of Z as shown in the following theorem: Theorem 1 (Efron-Stein Inequality). Let V = ∑n i=1(Z − E−iZ). Under the setting described in this section, it holds that V[Z] ≤ EV . The proof of Theorem 1 can be found in (Boucheron, Lugosi, and Massart 2004). Another variant of the Efron-Stein inequality which will turn more useful for our context, is concerned with the removal of one example from Sn. To state the result, let fi : Xn−1 7−→ R, for 1 ≤ i ≤ n, be an arbitrary measurable function, and define the RV Z−i = fi(S n ). Then, the Efron-Stein inequality can be also stated in the following interesting form (Boucheron, Lugosi, and Massart 2004, Theorem 6) Corollary 1 (Efron-Stein Inequality – Removal Case). Assume that E−i[Z−i] exists for all 1 ≤ i ≤ n, and let VDEL = ∑n i=1 (Z − Z−i). Then it holds that V[Z] ≤ EV ≤ EVDEL . (1) It may be surprising at a first sight that V[Z] can be bounded in terms of VDEL which relies on the arbitrary functions fi unrelated to f . The proof in (Boucheron, Lugosi, and Massart 2004) reveals that there is no mistake here. 2.1 An Exponential Efron-Stein Inequality The work of Boucheron, Lugosi, and Massart (2003) has focused on controlling the tail of general functions of independent RVs in terms of the tail behavior of the Efron-Stein variance terms such as V and VDEL, as well as other variance terms known as V + and V − (Boucheron, Lugosi, and Massart 2013). The later variance terms will not be presented here since they do not serve our purpose. The tail of a RV is often controlled through bounding the logarithm of the moment generating function (MGF) of the RV. This is known as the cumulant generating function (CGF) of the RV and is defined as: ψZ(λ) . = logE [exp(λ(Z − EZ))], where λ ∈ dom(ψZ) ⊂ R, and belongs to a suitable neighborhood of zero. The main result of Boucheron, Lugosi, and Massart (2003) bounds ψZ in terms of the MGF for V , V + and V −, but not in terms of the MGF for VDEL. Since we are particularly interested in the RV VDEL, the following theorem bounds the tail of ψZ in terms of the MGF for VDEL. The proof is given in the Appendix. Theorem 2. Let Z = f(X1, . . . , Xn) be a real valued function of n independent RVs. For all θ > 0, λ ∈ (0, 1], θλ < 1, and EeDEL <∞, the following holds logE [exp (−λ(Z − EZ))] ≤ λθ(1− λθ)−1 logE [ exp ( λθVDEL )] . (2) Theorem 2 states that the CGF of the centered RV Z is upper bounded by the CGF of the RV VDEL. Hence, when VDEL behaves “nicely”, the tail ofZ can be controlled. The value of θ in the upper bound is a free parameter that can be optimized. For Theorem 2 to be useful in our context, further control is required to upper bound the tail of VDEL. Our approach to control the tail of VDEL will be, again, through its CGF. In particular, we will show that when VDEL is a sub-gamma RV (defined shortly) we can obtain a high probability tail bound on the deviation of the RV Z. The obtained tail bound will be instrumental in deriving the exponential tail bound for the KFCV estimate. Sub-Gamma RVs: A real valued centered RV X is said to be sub-gamma on the right tail with variance factor v and scale parameter c if for every λ such that 0 < λ < 1/c, the following holds ψX(λ) ≤ 12λ v(1− cλ)−1. (3) This is denoted by X ∈ Γ+(v, c). Similarly, X is said to be a sub-gamma RV on the left tail with variance factor v and scale parameter c if −X ∈ Γ+(v, c). This is denoted as X ∈ Γ−(v, c). Finally, X is simply a sub-gamma RV with variance factor v and scale parameter c if X ∈ Γ+(v, c) and X ∈ Γ−(v, c). This is denoted by X ∈ Γ(v, c). The sub-gamma property can be characterized in terms of tail behavior or moment conditions as follows from Theorem 2.3 stated in (Boucheron, Lugosi, and Massart 2013): Theorem 3. Let X be a centered RV. If for some v > 0 and c ≥ 0 P [ X > √ 2vt+ ct ] ∨ P [ −X > √ 2vt+ ct ] ≤ e−t , (4) for every t > 0, then for every integer q ≥ 1 ‖X‖2q ≤ (q!A q + (2q)!B) ≤ √ 16.8qv ∨ 9.6qc ≤ 10(√qv ∨ qc), where A = 8v, B = 4c, and x∨ y = max(x, y). Conversely, if for some positive constants u and w, for any integer q ≥ 1, ‖X‖2q ≤ √ qu ∨ qw , then (4) holds with v = 4(1.1u+ 0.73w) and c = 1.46w. The reader may notice that Theorem 3 is slightly different than the original version in (Boucheron, Lugosi, and Massart 2013). Our extension to the main result of Boucheron, Lugosi, and Massart (2013) is based on simple calculations that are merely for convenience with respect to our purpose. 2.2 An Exponential Tail Bound for Z In this section we assume that VDEL is a sub-gamma RV with variance factor v > 0, scale parameter c ≥ 0, and cλ < 1. Hence, from (3) it holds that ψVDEL−EVDEL(λ) . = logE [exp(λ(VDEL − EVDEL))] ≤ 12λ v(1− cλ)−1. The sub-gamma property of VDEL provides the desired control on its tail. That is, after arranging the terms of the above inequality, the CGF of VDEL which controls the tail of VDEL, is upper bounded by the deterministic quantities: EVDEL, the variance v, and the scale parameter c. Therefore, it is possible now to use the sub-gamma property of VDEL to extend the result of the exponential Efron-Stein inequality in Theorem 2. In particular, the following lemma gives an exponential tail bound on the deviation of a function of independent RVs, i.e. Z = f(X1, . . . , Xn), in terms of EVDEL, the variance factor v, and the scale parameter c. This lemma will be our main tool to derive the exponential tail bound for the KFCV estimate. The proof is given in the Appendix. Lemma 1. Let the RVs Z, Z−i, and VDEL be defined as above. If VDEL −EVDEL is a sub-gamma RV with variance parameter v > 0 and scale parameter c ≥ 0, then for δ ∈ (0, 1), a > 0, with probability 1− δ |Z − EZ| ≤ 43 (ac+ 1 a ) log ( 2 δ ) + 4 √ (EVDEL + a v 2 ) log ( 2 δ ) . Parameter a in the upper bound is a free parameter that can be optimized to provide the tightest possible bound. A typical choice of a would be the inverse standard deviation of Z. Lemma 1 is our first contribution in this work: recalling the definition of the RV Z – a function of n independent RVs – Lemma 1 gives an exponential tail bound on the deviation of Z from its expectation by controlling the tails of its variancelike components Z−i and hence VDEL, which in turn is a sub-gamma RV with bounded higher order moments. In the second contribution, we will use Lemma 1 to develop a high probability generalization bound for the KFCV estimate (which will replace the RV Z) in terms of the “stability” of the learning rule. Due to the definition of VDEL, stability of the learning rule will turn to be instrumental in bounding the higher order moments of VDEL, and hence for upper bounding the KFCV estimate. However, to derive the desired bound, it remains to formally define the KFCV estimate, and the notion of stability that will permit us to derive such a high probability bound. We pursue this in the following two sections. 3 Risk Estimators The generalization bounds on the risk usually center on some point-estimate of the random risk R(A(Sn),P). Many estimators are based on calculating the sample mean of losses in one form or another. For any fixed hypothesis h ∈ H we define the empirical risk of h on Sn as R̂(h,Sn) = 1 n ∑n i=1 ` (h,Xi). Plugging A(Sn) into R̂(·,Sn) we get the training error or resubstitution (RES) estimate (Devroye and Wagner 1979): R̂RES (A,Sn) = R̂ (A (Sn) ,Sn). The resubstitution estimate is often overly “optimistic”, i.e., it underestimates the actual risk R(A(Sn),P). The leave–one–out or deleted (DEL) estimate (Devroye and Wagner 1979) is a common alternative to the resubstitution estimate that aims to correct for this: R̂DEL (A,Sn) = 1 n ∑n i=1 ` ( A(S−i n ), Xi ) , where S−i n is defined as in the previous section. Since E[` ( A(S−i n ), Xi ) ] = Rn−1(A,P), then E[R̂DEL(A,Sn)] = Rn−1(A,P). When the latter is close to Rn(A,P), i.e., A is “stable”, the deleted estimate may be a good alternative to the resubstitution estimate. However, due to the potentially strong correlations between elements of (`(A(S−i n ), Xi))i, the variance of the deleted estimate is expected to be higher than that of the resubstitution estimate (there is much redundancy in the information content of `(A(S−i n ), Xi) and `(A(S−j n ), Xj) for i 6= j). Another downside of the deleted estimate is its high computational cost. That is, to evaluate R̂DEL (A,Sn) for Sn, one has to execute the learner A on S−i n to obtain hypothesis ĥi, for i = 1, . . . , n; i.e. execute A for n times. For large n, this is indeed prohibitive. The KFCV estimate provides a way of naturally interpolating between the resubstitution and the deleted estimate (Stone 1974; Geisser 1975). For simplicity, assume that the sequence Sn can be partitioned into k equal folds F1,...,k . = (F1 . . . Fk), where each fold Fj is a sequence that has exactly m examples from Sn; i.e. Sn = (F1 F2 . . . Fk). In particular, we assume that n = mk. This assumption is merely for convenience: all of our results extend to the general case with some extra effort. KFCV proceeds by learning k hypotheses ĥ1, . . . , ĥk, where ĥj = A(S −[Fj ] n ), and Sj ] n is the sequence (F1 . . . Fj−1 Fj+1 . . . Fk). The empirical risk of ĥj is obtained by evaluating ĥj onFj which was “held out” while running A on Sj ] n . The KFCV estimate for the risk is the average of the empirical risks of the k hypotheses

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An a Priori Exponential Tail Bound for k-Folds Cross-Validation

We consider a priori generalization bounds developed in terms of cross-validation estimates and the stability of learners. In particular, we first derive an exponential Efron-Stein type tail inequality for the concentration of a general function of n independent random variables. Next, under some reasonable notion of stability, we use this exponential tail bound to analyze the concentration of ...

متن کامل

Cross-Validation and Mean-Square Stability

k-fold cross validation is a popular practical method to get a good estimate of the error rate of a learning algorithm. Here, the set of examples is first partitioned into k equal-sized folds. Each fold acts as a test set for evaluating the hypothesis learned on the other k − 1 folds. The average error across the k hypotheses is used as an estimate of the error rate. Although widely used, espec...

متن کامل

Stability of cross-validation and minmax-optimal number of folds

In this paper, we analyze the properties of cross-validation from the perspective of the stability, that is, the difference between the training error and the error of the selected model applied to any other finite sample. In both the i.i.d. and non-i.i.d. cases, we derive the upper bounds of the one-round and average test error, referred to as the one-round/convoluted Rademacher-bounds, to qua...

متن کامل

C Cross - Validation

Definition Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of ...

متن کامل

Classification Accuracy and Model Selection in k-Nearest Neighbors Classifiers for Data Driven Learning

Pattern classification is a core research area and a main task in pattern recognition. A classifier induced by machine learning algorithms maps an unlabeled instance to a label using internal data structures. In this paper we experiment first by changing the k value of nearest neighbors from 3 to 15 and compare the accuracy of two classifiers on various training and test sets. The results show ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018